Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix S3 filenames that include a hash #121

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cyreb7
Copy link

@cyreb7 cyreb7 commented Nov 14, 2024

Ran into this when we switched to using Zip Stream recently, S3 filenames with a hash # were being truncated. The Laravel Storage facade handles hashes without issue.

Possible breaking changes I can think of with this PR:

  1. Using strlen("s3://{$this->getBucket()}/"); means that if anyone has a malformed S3 bucket string, this would fail in unexpected ways
  2. If anyone was relying on the old behavior of parse_url(), this does change how S3 file paths are parsed

@jszobody
Copy link
Member

See #110

@cyreb7
Copy link
Author

cyreb7 commented Nov 14, 2024

Oh gosh, sorry, I didn't see that PR. I don't mind if you want to close this, but I'd argue that while AWS doesn't recommend hashes, they do support it. Plus at least two people have run into this in the wild and had the motivation to open a PR to fix it, so it's clearly a real-world issue.

Did you ever look into using the AWS parser? If you did, then we could at least bug AWS support instead of you 😆

@jszobody
Copy link
Member

You make a good point. I'm willing to reconsider this. Did you see how Laravel's storage facade handles it? It might be at least worth making sure we're taking the same approach.

@cyreb7
Copy link
Author

cyreb7 commented Nov 18, 2024

Digging into how Laravel handles it, they use the Flysystem library, and Flysystem stores the S3 bucket and path separately, sidestepping the issue of deconstructing paths.

Looking at the ZipStream code again, I realize calculateFilesize() is the only place we actually need to deconstruct the URI, so if we can avoid using headObject(['bucket' => ..., 'key' => ...]), we can fix the issue. When I have time later I'll try using the stream wrapper like the other functions already do -- the docs say it supports filesize().

@jszobody
Copy link
Member

Hmm. filesize($this->getReadableStream()) would be interesting to test out. That already sets up the stream wrapper.

Keep in mind there are two loops over our entire queue of files: first to calculate the sizes and send out a Content-Length header, and then a second time when reading in the file contents. I'm always paranoid of the performance of that first loop. I don't know if setting up the stream wrapper times potentially a ton of files just to get filesize will be any different than a direct headObject call, but it's something we'll want to test. I know we zip thousands of files at once, sometimes.

@cyreb7
Copy link
Author

cyreb7 commented Nov 23, 2024

Sorry for not reporting back earlier, but I've been busy. But I have now investigated using a stream and made an interesting discovery: the AWS stream wrapper already gets the file size every time a new read stream is created, so using the stream to get the filesize is essentially free!

I setup a benchmark on my local that creates 1,000 small 100 character random text files on S3, then zips them to a local file and times it. Running it a couple times gets:

Using $this->getS3Client()->headObject()

Average    : ~250ms
Peak Memory: ~72 MB

Using $this->getReadableStream()->getSize()

Average    : ~190ms
Peak Memory: ~72 MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants